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Abstract 

Searching the space of policies directly for the 
optimal policy has been one popular method for 
solving partially observable reinforcement learn- 
ing problems. Typically, with each change of the 
target policy, its value is estimated from the re- 
sults of following that very policy. This requires 
a large number of interactions with the environ- 
ment as different polices are considered. We 
present a family of algorithms based on likeli- 
hood ratio estimation that use data gathered when 
executing one policy (or collection of policies) 
to estimate the value of a different policy. The 
algorithms combine estimation and optimization 
stages. The former utilizes experience to build 
a non-parametric representation of an optimized 
function. The latter performs optimization on 
this estimate. We show positive empirical results 
and provide the sample complexity bound. 



1. Introduction 

Research in reinforcement learning focuses on designing 
algorithms for an agent interacting with an environment to 
adjust its behavior to optimize a long-term return. For envi- 
ronments which are fully-observable (i.e. the observations 
the agent makes contain all of the necessary information 
about the state of the environment), this problem can often 
be solved using a one-step look ahead analysis to formulate 
the solution as a dynamic programming problem. However, 
for the case of partially observable domains (i.e. the obser- 
vations are stochastic or incomplete representations of the 
environment's state), the perceptual aliasing of the obser- 
vations makes such methods infeasible. 

One viable approach is to search directly in a parameter- 
ized space of policies for a local optimum. Following 
Williams's REINFORCE algorithm (?), searching by gra- 



dient descent has been considered for a variety of policy 
classes (?; ?; ?; ?; ?). A commonly recognized short- 
coming of all these variations on gradient descent policy 
search is that they require a very large number of samples 
(instances of agent-environment interaction) to converge. 

This inefficiency arises because the value of the policy (or 
its derivative) is estimated by sampling from the returns 
obtained by following that same policy. Thus, after one 
policy is evaluated and a new one proposed, the samples 
taken from the old policy must be discarded. Each new 
step of the policy search algorithm requires a new set of 
samples. The key to solving this inefficiency is to use data 
gathered when using one policy to estimate the value of 
another policy. The method known as "hkehhood ratio" 
estimation enables this data reuse. 

Stochastic gradient methods and likelihood ratios have 
been long used for optimization problems (see work of (?; 
?; ?; ?)). Recently, stochastic gradient descent methods, 
in particular REINFORCE (?; ?), have been used in con- 
junction with policy classes constrained in various ways: 
with external memory (?), finite state controllers (?) and 
in multi-agent settings (?). The idea of using likelihood ra- 
tios in reinforcement learning was suggested by ? (?) and 
developed for solving MDPs with function approximation 
by ? (?) and for gradient descent in finite state controllers 
by ? (?). However, only on-line optimization was consid- 
ered. (?; ?) developed greedy algorithm for combining 
samples from multiple policies in normalized estimators 
and demonstrated a dramatic improvement in performance. 
? (?) showed that likelihood-ratio estimation enables the 
application of methods from statistical learning theory to 
derive PAC bounds on sample complexity. 

? (?) provide a method for estimating the return of ev- 
ery policy simultaneously using data gathered while exe- 
cuting a fixed policy without the use of likelihood ratios. 
In some domains, there is a natural distance between ob- 
servations and actions which also allows one to re-use ex- 



perience without likelihood ratio estimation. ? (?) demon- 
strate algorithms for kernel-based RL in one such domain: 
financial planing and investments. 

This paper extends our previous work by presenting a gen- 
eralized method of using UkeUhood ratio estimation in pol- 
icy search and investigating the performance of this method 
under different conditions on illustrative examples. By 
this publication we hope to stimulate a dialog between the 
communities of reinforcement learning and computational 
learning theory. We present a clear outhne of all algorithms 
in a hope to attract wider research community to applying 
these algorithms in various domains. We also present some 
new bounds on a sample complexity of these algorithms, 
making an attempt to relate these results to empirical re- 
sults. We begin this paper with a brief definition of re- 
inforcement learning and sampling in order to clarify our 
notation. Then we present our algorithm and consider the 
question of how to sample. Finally we consider the ques- 
tion of how much to sample and present a PAC-style bound 
as a quantitative answer. 

2. Background 

We introduce the environment model and importance sam- 
pling in a single mathematical notation. In particular, we 
keep the standard notation for partially observable Markov 
decision processes and modify the sampUng notation to be 
consistent. 

2.1 Environment Model 

The class of problems we consider can be described by 
the partially observable Markov decision process (POMDP) 
model. In a pomdp, a sequence of events occur for each 
time step: an agent observes the observation o{t) G O de- 
pendent on the state of environment s{t) € S; it performs 
an action a{t) € A according to its policy, inducing a state 
transition of the environment; then it receives a reward r{t) 
based on the action taken and the environment's state. A 
POMDP is defined by four probabiUty distributions (and the 
spaces over which those distributions are defined): a distri- 
bution over starting states, a distribution over observations 
conditioned on the state, a distribution over next states con- 
ditioned on the current state and the agent's action, and a 
distribution over rewards given the state and action. These 
distributions, specifying the dynamics of the envirormient, 
are unknown to the agent along with the state space of pro- 
cess, S. 

Let = {(o(l), a(l), r(l), . . . , o{t),a{t), r{t), o{t + 1))} 
denote the set of all possible experiences sequences of 
length t. Generally speaking, in a POMDP, a policy tt 
is a function specifying the action to perform at each 
time step as a function of the whole previous history: 



TT : H ^ V{A). This function is parameterized by a vector 
9 € 0. Policy class 8 is a set of policies realizable by 
all parameter settings. We assume that the probability 
of the elementary event is bounded away from zero: 
< c < Pr(a|/i, 9) <c<l, for any a e A, h e H, and 
9 € Q. A history h includes several immediate rewards 
(r(l), . . . , r{i), . . .) that are typically summed to form a 
return, R{h), but our results are independent of the method 
used to compute the return. 

Together with the distributions defined by the pomdp, any 
policy 6 G Q defines a conditional distribution Pr(/i|^) on 
the class of all histories H. The value of policy 9 is the 
expected return according to the probabihty induced by 
this poUcy on the history space: V{0) = Eg [R{h)] = 
Y^heH l-^W Pr(^l^)] , where Eg stands for Ep^.(^^gy We 
assume that policy values (and returns) are non-negative 
and bounded by Vmax- The objective of the agent is to find 
a policy 9* with optimal value: 9* = argmaxgF(6'). Be- 
cause the agent does not have a model of the environment's 
dynamics or reward function, it can not calculate Pr(/i|^) 
and must estimate it via sampUng. 

2.2 Sampling 

If we wish to estimate the value V{6) of the policy 
9, we may draw sample histories from the distribution 
induced by this policy by executing the pohcy multi- 
ple times in the environment. After taking N samples 
h = {hi}, i G (1, . . . , -/V) we can use the unbiased estima- 
tor: 



Imagine, however, that we are unable to sample from the 
policy 9 directly, but instead have samples from another 
policy 9' . The intuition is that if we knew how "similar" 
those two pohcies were to one another, we could use sam- 
ples drawn according to the distribution 9' and make an ad- 
justment proportional to the similarity of the policies. For- 
mally we have: 



Pr{h\8) 



Pr{h\9') = Eg, R{h) 
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Now we can construct an unbiased indirect estimator for 

the distribution Pr(/?|f?') which is called an importance 
sampling estimator (?) V^fj^{9) of V{9): 
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We can normalize the importance samphng estimate to ob- 
tain a lower variance estimate at the cost of adding bias. 
Such an estimator is called a weighted importance sam- 



pling estimator and has the form 
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which has been found to be better-behaved than Vg^j^{6) 
both theoretically and empirically (?; ?; ?). 

Note that both estimators contain the quantity p^^-^ , a 
ratio of likelihoods. The key observation for the remain- 
der of this paper is that while an agent is not assumed 
to have a model of the environment and therefore is not 
able to calculate Pr{h\9), it is able to calculate the like- 
lihood ratio 



Pr(fe|6>) 
Pr(h.|e' 



for any two policies 9 and 9' (?; ?; 
?). Pr(ft,|6') can be written as a product of and 
where = IlLi ■ ■ -,0(1), 9) is the con- 

tribution of all of the agent's actions to the likelihood of 
the history, and '^{h) is the contribution of environmental 
events. Because the component is independent of the 
policy (i.e. it does not depend on the policy parameter, only 
on the history and the POMDP distributions), it cancels from 



the ratio, and we have 
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^{h\9) depends 
only on the agent, the observations, and the actions (not the 
states), is known to the agent, and can be computed and 
differentiated. This allows us to construct more efficient 
learning algorithms that can take advantage of past experi- 
ence. 

Finally, if the sampling distribution is not constant (i.e. 
each sample is drawn from a different distribution), a sin- 
gle unbiased importance sampling estimator can be con- 
structed by using all of the samples where the assumed sin- 
gle sampling distribution is the mixture of the true sam- 
pling distributions. Thus, if samples were taken according 
to policies 6'i, 6*2, . . . , 9n, Pi'{h\9') from above is replaced 
with -i- Pi-{h\9j). ? (?) gives more details for impor- 
tance sampling estimators with independent, but not identi- 
cally drawn, samples. Using this new estimator allows us to 
change policies (sampling distributions) during sampling. 

3. Algorithms 

Consider constructing a proxy environment that contains a 
non-parametric model of the values of all policies as illus- 
trated by Figure ^ This model is a result of trying several 
policies 9i . . .9]^. Given an arbitrary new policy 9 G Q, 
the proxy environment returns an estimate of its value V{9) 
as if the policy were tried in the real environment. As- 
suming that obtaining a sample from environment is costly, 
we want to construct the proxy module based only on a 
small number of queries about policies {9i},i = 1 . . . N 
that return values Ri. These queries are implemented by 
the sample routine (Table [l|). After getting N samples, 
it requires memory of size 0(A^T(log 101+ log |^|)) to 
store the data, where T is the length of a trial and O and A 



Table 1. The sample routine accepts a policy parameter setting 
and outputs the return and history from one sample of the policy. 
Note that in many cases the entire history need not be returned. 

Input: policy 9 
Init: 

h'- (),i?<-0 
Get initial observation oq. 
For each time step t of the trial: 
Draw next at from 7r(o(, at, 9) 
Execute at . 

Get observation Of , reward r^. 

R + rt 

concatenate(ft., (ot, at)) 
Output: experience (i?, h) 



are the sets of possible observations and action respectively. 
However, for many policy classes this memory requirement 
can be reduced. For example, if the policy class is reactive 
(conditioned on the current observation, the probability of 
the current action has no dependence on the past), the his- 
tory can be summarized sufficiently by the counts of the 
number of times each action was chosen after each obser- 
vation. This requires memory of size 0(A^log(r)|C'||y^|). 




flcarmn^ ^^^^^ 



Figure 1. A diagram of the policy evaluation process. The sam- 
pling process is costly and therefore only performed a limited 
number of times. The proxy collects the samples from the en- 
vironment and constructs an agent-centric model that predicts the 
effects of hypothetical agent policies. The agent learns by inter- 
acting with the proxy. 

This proxy can be queried by the learning algorithm as 
shown in table ^. In response to a policy parameter settings, 
the routine evaluate returns its estimate of the expected 
return and its derivative. The algorithm shown in table |^ 
computes the weighted importance sampling estimate. For 



Table 2. The evaluate routine computes the proxy's estimate 
of the value of a pohcy and its derivative. The inner loop (over j) 
can be removed with caching, making the routine faster. 

Input: policy 0, data D = {{9,, R„ h,)} for i e {I . . . N} 
Init: V ^ 0, AV ^ 0, K ^ 
For i = 1 to A^: 

Init: ^ 

For j = 1 to N: 

For I = 1 to TV: 

V ^V- 

For « = 1 to A^: 

AV ^ AV + iR, ~V)^ 
AV^^ 

Output: proxy evaluation V and derivative AV 

simplicity, the inner loop (over j) is shown, in practice the 
computations in this loop do not need to be redone for every 
evaluation. Using memory of size 0{N), the values $' can 
be computed ahead of time (in constant time per sample) 
thus reducing the evaluation to 0{N) time. 

The evaluate routine relies on two other routines: one 
to calculate ^{h\9) and one to calculate the derivative of 
^{h\9). Recall that ^{h\9) is the poHcy's factor of the 
probability of the history h. As an example, if we assume 
the policy to be reactive, the parameter 9o.a to be the prob- 
ability to selecting action a after observing o, and Uo.a to 
be the count of the number of times action a was chosen 
after observing o during the history, 

<fih\9)=l[{9o.ar"-- 

O.a 



d9o.a 



? (?); ? (?) describe how to compute these quantities for 
reactive policies with Boltzmann distributions and ? (?) 
describes how to compute these quantities for finite-state 
controllers. 

Any policy search algorithm can now be combined with 
this proxy environment to learn from scarce experience. 
Table || shows a general reinforcement learning algorithm 
family using the proxy. The definitions of pick_s ample, 
add_data, and optimize are crucial to the behav- 
ior of the algorithm. The REINFORCE algorithm (?) is 
one particular instantiation of the learn routine where 
pick_sample returns 9* without consulting the data. 



TaWe 3. The learn routine accepts the number of trials it is al- 
lowed and returns its guess at the optimal pohcy. It relies on four 
external routines: pick_s ample which selects a policy to sam- 
ple given the data and the current best guess, sample as shown 
in table |l| add.data which adds the new data point to the data 
collected so far, and optimize which performs some form of 
optimization on the proxy evaluation function. 

Input: number of samples/trials N 
Init: D ^ (), 9* ^ random policy 
For i = Ito N: 

9 ^ pick_sample(Z3, 9*) 
{R,h) ^ sample(6') 
D ^ add_data(i:), {9, R, h}) 
9* ^ opt±m±ze{D,9*) 
Output: hypothetical optimal policy 9* 



add_data forgets all of the previous data and replaces it 
with the most recent sample, and optimize performs one 
step of gradient descent (using V^^ instead of l/Wis-j jj^g 
exploration extension to REINFORCE proposed by ? (?) is 
exactly the same except the pi ck_s ample routine now 
returns a policy that is a mixture of 9* and a random policy. 

In order to make effective use of all of the data, we define 
add_data to append the new data sample to the collection 
of data. This allows our algorithm to remember all previ- 
ous experience. Additionally, we use an optimize rou- 
tine that performs full optimization (not just a single step). 
In REINFORCE and other policy search methods, the cur- 
rent policy guess embodies all of the known information 
about the past (forgotten) samples. It is therefore important 
to only take small steps of decreasing size to insure the al- 
gorithm converges. Because we now remember all of the 
previous samples and we do not have any restraint on which 
policy we must use for the next sample, we can search for 
the true optimum of the estimator at every step. 

4. How to Sample? 

We still are left with a choice for the routine 
pick_sample. This routine represents our balance be- 
tween exploration and exploitation. For this paper, we will 
consider a simple possibility to illustrate this trade-off. We 
let the pick_sample routine have a single parameter p*. 
pick_sample is stochastic and with probability p* it re- 
turns 9* . The remainder of the time it returns a random 
policy chosen uniformly over the space 9. Thus, the larger 
the value of p*, the more exploitative the algorithm is. 
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Figure 2. Results for the bandit problems. Top: "HT": hidden 
treasure. Bottom: "HF": hidden failure. Plotted is the expected 
return of the resulting policy against the number of sample taken, 
A'^, and the probability of exploiting, p* . 



4.1 Illustration: Bandit Problems 

Let us consider a trivial example of a bandit problem to il- 
lustrate the importance of exploitation and exploration. The 
environment has a degenerate state space of one state, in 
which two actions, ai and 02, are available. The space of 
policies available is stochastic and encoded with one pa- 
rameter, the probability of taking the first action, which 
is constrained to be in the interval [c, c] = [-1, -9]. We 
consider two problems, called "HT" (hidden treasure) and 
"HF" (hidden failure) both of which have the same ex- 
pected returns for actions: 1 for ai and for 02. In HF, ai 
always returns 1, while 02 returns 10 with probability .99 
and —990 with probability .01. In HT, 02 always returns 0, 
while fli returns —10 with probability .99 and +1090 with 
probability .01. We would expect a greedy learning algo- 
rithm to sample near policies that look better under scarce 
information, tending to choose the sub-optimal a2 in the 
HT problem. This strategy is inferior to blind sampling, 
which samples uniformly from the policy space and will 
discover the hidden treasure of ai faster. By contrast, for 



the HF problem we would expect the greedy algorithm to 
do better by initially concentrating on the 02, which looks 
better, and discovering the hidden failure sooner than blind 
sampling. 

We ran the learn algorithm from table |^ for different set- 
tings of the parameters and p* . Figure || shows the true 
value of the resulting policies, averaged over 1000 runs of 
the algorithm. While the plots may look discouraging, re- 
member that these problems are in some ways a worse-case 
situation. The true value of the actions only becomes appar- 
ent after sampling on the order of 100 times. The plots sup- 
port our hypothesis about the relative success of exploita- 
tion. However, although acting greedily is somewhat better 
in HF, it is much worse in HT. This illustrates why, with- 
out any prior knowledge of the domain and given a limited 
number of samples, it is important not to guide sampling 
too much by optimization. 

5. How Much to Sample? 

If we wish to guarantee with probability 1 — 5 that the er- 
ror in the estimate of the value function is less than e, we 
can derive bounds on the necessary sample size N, which 
depend on 5, e, Vmax, and the complexity of the hypoth- 
esis class expressed by the covering number J\f. Our new 
result is an extension of the sample complexity bound for 
the IS estimator (?) to the WIS estimator. We only quote 
the results here. The key point in the derivation is the fact 
that 



sup 

h.h'eH 



WIS 



yWIS 



< 



where h' differs from h only by one member trajectory 
hi. Two inequalities follow from this fact. Denote r] = 
max(c"^, (1 — c)"^). The variance of the WIS estimator ac- 
cording to Devroye's theorem is bounded as 



Var 



! 



4(7V + 7^2)2 



and McDiarmid's theorem ? (?) gives us a PAC bound (for 
derivation see ? (?)): 



Pr 



sup 



E 



4A/-(e,l) 



< 



e2(Ar + ^2)2i 



exp 



774 iV 



which gives a sample complexity bound very similar to 
these obtained by ? (?) (see next section). It is well 
known (?) that variance of WIS estimate is 0{jj'). The 
weak dependence on the horizon T is interesting and in 
accordance with empmcal findings. The covering number 
is defined through the value V{9) and describes the com- 
plexity of a policy class (e.g. reactive policies or finite state 



Table 4. Comparison of sample complexity bounds. 



Algorithm 


Lower bound on sample complexity N 


likelihood ratio 




V ^ ) 




reusable trajectories 




^nax 


2^^VC{e) (t + log j + log(l/<5)) log(T) 



controllers) with respect to the stx'ucture of a reward func- 
tion. 

5.1 Comparison to vc Bound 

The pioneering work by ? (?) considers the issue of gen- 
erating enough information to determine the near-best pol- 
icy. We compare our sample complexity results from above 
with a similar result for their "reusable trajectories" algo- 
rithm. Using a random policy (selecting actions uniformly 
at random), reusable trajectories generates a set of history 
trees. This information is used to define estimates that uni- 
formly converge to the true values. The algorithm relies on 
having a generative model of the environment, which al- 
lows simulation of a reset of the environment to any state 
and the execution of any action to sample an immediate re- 
ward. The reuse of information is partial: the estimate of a 
policy value is built only on the subset of experiences that 
are "consistent" with the estimated policy. 

We will make a comparison based on a sampling poUcy that 
selects one of two actions uniformly at random: Pr{a\h) = 
i. For the horizon T, this gives us an upper bound 77 on the 
likelihood ratio: 

we{h,e') < 2^{l -cf = T] . 

Substituting this expression for rj, we can compare our 
bounds to the bound of ? (?) as presented in table The 
metric entropy JC{Q) takes the place of the VC dimension 
VC (O) in terms of policy class complexity. Metric entropy 
is a more refined measure of capacity than VC dimension; 
the VC dimension is an upper bound on the growth function 
which is an upper bound on the metric entropy (?). 

5.2 Illustration: Load-Unload Problem 

The complexity of the problem is measured by the cover- 
ing number, J\f. It encodes the complexity of the combina- 
tion of the POMDP and the policy class. We use the load- 
unload problem of figure || to illustrate the effect of policy 
complexity. The agent is a cart designed to shuttle loads 
back and forth between two end-points on a line. The cart 
does not have sensors to indicate whether it is loaded or 
unloaded, but it can determine its position on the line. The 
optimal policy is one where the cart moves back and forth 
between the leftmost and rightmost states moving as many 
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Figure 3. Diagram of the "load-unload" world. The agent ob- 
serves its position (box) but not whether the cart is loaded (node 
within the box). The cart loads in the left-most state. If it reaches 
the right-most position while loaded (upper path), it unloads and 
gets a unit of reward. The agent has a choice of moving left or 
right at each position. Each trial begins in the "load" state and 
lasts 100 steps. The optimal controller requires one bit of mem- 
ory. 



loads as possible. To do so requires some form of memory. 
For our example, we will use finite-state controllers with 
fixed memory sizes (?; ?). 

A finite-state controller is a class of policies with a fixed 
memory size. The controller has its own internal memory 
state that is restricted to one of a finite number of values. At 
each time step, it not only selects which action to take, but 
also a memory state for the next time step. The controller's 
choice of action and next memory state are independent of 
the past given the current observation and memory state. 
This model is an extension of the reactive policy class to 
allow the controller to remember a small amount about the 
past. Finite-state controllers have the capability of remem- 
bering information for an arbitrarily long period of time. 

Figure ^ demonstrates the effect of policy complexity on 
the performance of the algorithm. This plot is the same as 
the ones in figure ^ except that the exploitation probabil- 
ity, p*, has been fixed at 0.5. The four lines depict results 
for different policy classes, 8. The solid line is for reac- 
tive policies (policies with no memory) whereas the dashed 
and dotted lines are for finite-state controllers with varying 
amounts of memory. Only one bit of memory is required 
to perform optimally in this environment. Using more than 
two states of memory is superfluous. We can see that the 
simpler the policy class, the more quickly the algorithm 
converges. However, with too simple a policy class (i.e. 
reactive policies), the convergence is to a suboptimal pol- 
icy. For comparison, the thin dashed line presents the be- 
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Figure 4. The expected return of the pohcy found by the algo- 
rithm as a function of the number of samples for the environment 
shown in figure |3| The exploitation parameter, p*, was set to 0.5. 
These results are averaged over 80 separate runs for each number 
of samples. The solid line plots the performance using a reactive 
policy (no memory). The dashed and dotted lines are for policies 
with differing amounts of memory. Notice that while the reactive 
policy class converges more quickly, its optimum is much lower. 
All of the other policy classes have the potential to converge to 
the optimal return of 13. However, increased policy complexity 
results in slower learning. 



haviour of REINFORCE with two internal state controller. 
As we have seen REINFORCE forgets past experience and 
picks up very slowly with the size of experience. 

6. Discussion 

Likelihood-ratio estimation seem to show promise in us- 
ing data efficiently. The pick_s ample routine we present 
here is only one (simplistic) method of balancing exploita- 
tion and exploration. More sophisticated methods includ- 
ing maintaining a distribution over the space of policies 
might allow for a better balance and the possibility of learn- 
ing a useful sampling bias in a policy space for a particular 
application domain and transferring it from one learning 
problem in that domain to another In general, estimating 
the variance of the proxy evaluator could aid in selecting 
new samples for either exploration or exploitation. 

Where REINFORCE keeps only the most recent sample, our 
algorithm keeps all of the samples. If a large amount of 
data is collected, it may be necessary to employ a method 
between these two extremes and remember a representative 
set of the samples. Deciding which samples to "forget" 
would be a difficult, but crucial, task. 



as a measure of the complexity of the policy space. Esti- 
mating the covering number is a challenging problem in it- 
self. However it would be more desirable to find a construc- 
tive solution to a covering problem in a sense of universal 
prediction theory (?;?). Obviously, given a covering num- 
ber there might be several ways to cover the space. Find- 
ing a covering set would be equivalent to reducing a global 
optimization problem to an evaluation of several represen- 
tative policies. 

Another way to use sample complexity results is to find 
what is the minimal experience necessary to be able to pro- 
vide the estimate for any policy in the class with a given 
confidence. This would be similar to the structural risk 
minimization principal by Vapnik ? (?). The intuition is 
that given very limited data, one might prefer to search a 
primitive class of hypotheses with high confidence, rather 
than to get lost in a sophisticated class of hypotheses due to 
low confidence. 
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